Add FP4 tile + vector types and memory operations#54
Open
ccs1112 wants to merge 3 commits into
Open
Conversation
Introduce a packing<fp4e2m1_2> specialization (num()=1, self-referential unpacked_type) and modify packing<fp4e2m1_4>'s num() from 4 to 2 with unpacked_type = fp4e2m1_2. HipKittens' rt_shape is type-agnostic, so num() must count unpacked-type units per packed-type element for the packed_per_thread arithmetic to hold the bf16/half/fp8 pattern. Update the T1 concept to include fp4e2m1_2 in place of fp4e2m1. Allow fp4e2m1_4 in rt_base's dtype allowlist. Relax st.cuh's num()==1 assertion to allow fp4e2m1_2, and reject the scalar fp4e2m1 in st and gl with messages pointing users to the packed types. Add sv_fp4e2m1_2 and rt_fp4e2m1_2 aliases and fp4e2m1_2 <-> float2 convertors.
shared_to_register.cuh: new fp4e2m1_4 branch in the row-layout load at RT::base_tile_stride == 16 (rt_16x128), mirroring the existing fp8e4m3_4 branch in width (ds_read_b128 with float4 cast). Handles both ST>=RT and ST<=RT base-tile configurations. global_to_register.cuh: add fp4e2m1_2 to the existing fp8e4m3 rejection in all three static_asserts (row-load, col-load, row-store). Direct g->r isn't used for packed sub-byte types. Also reject fp4e2m1_2 in shared_to_register.cuh's register->shared store path (not in PR 1's scope), and reject the scalar fp4e2m1 in the load path to backstop the higher-level guards in st.cuh and gl.cuh.
testing_utils.cuh: pair-granularity initialize and validate branches for fp4e2m1_2. i_ref/o_ref are sized in pair units; each pair is packed as (f, f) so both halves dequantize to the same value. Tolerance sized to FP4's 16-point grid (absolute 0.5). fp4_load.cu (new): hand-rolled kernel that loads packed FP4 global -> shared -> register and dumps each thread's 32 fp4e2m1_4 elements as 128 floats. Host sorts both sides and does multiset comparison with FP4-grid tolerance. Exercises the new shared_to_register fp4e2m1_4 branch that the existing sharedreg_load_store round-trip can't reach (register -> shared FP4 store isn't in HazyResearch#47's scope). global_to_shared.cu: extend the type sweep with fp4e2m1_2, exercising the g->s and s->g paths for packed FP4 at every supported shape. testing_flags.cuh: wire TEST_WARP_MEMORY_TILE_FP4_LOAD into the TEST_ALL_WARP_MEMORY_TILE expansion and the TEST_WARP_MEMORY_TILE derivation. tile.cu/cuh dispatches the new test. Makefile adds the flag to the default build.
Collaborator
|
Thanks! Can we actually work on FP4 in a branch together? I think it'll keep the main branch more stable as we go about things. |
Author
|
Sounds good, looks like there's no existing FP4 branch on this upstream. Do you want to make one and I'll retarget the PR? I don't have write access on |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR closes #46 and #47. It lands
fp4e2m1_2as the canonical sub-byte tile dtype and adds the three load paths from #47.FP4 breaks the
sizeof(scalar) × packing::num() == sizeof(packed)invariant, so shared storage has to pick between packed and unpacked. Unpacked halves the shared budget and forces ALU nibble-packing on everyshared → registerload. A newst<fp4e2m1, ...>specialization would be perf-optimal but rewrites a core struct for one dtype. Usingfp4e2m1_2as the dtype, which I gleaned from ThunderKittens, has the same performance and costs only one relaxedstatic_assert, so I took this route.rt_shapeis type-agnostic, sonum()has to count unpacked-type units per packed-type: 2 for FP4. The physical register layout is the same.rt_shapetype params can be added as a follow up if so desired.The crux is the new
fp4e2m1_4branch inshared_to_register.cuh. It's basically thefp8e4m3_4branch copied with the new types, aliases and some compiler assertions. I don't thinkglobal_to_shared.cuhneeds any changes because the existing byte math handlessizeof(dtype)==1when the columns are in pairs.I wrote a test file with some caveats:
GPU_TARGET=CDNA4 make -j13compiles clean on MI300X (Hot Aisle,rocm/7.0-previewcontainer)GPU_TARGET=CDNA3. I know main targets CDNA4 so if v3 support has been dropped then this isn't a bug, otherwise the fix for this can be filed separately:Cannot select: intrinsic llvm.amdgcn.raw.buffer.load.lds